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IN THE CLAIMS; 

Kindly replace the claims of record with the following full set of claims; 

1 . (Previously presented) An audio-visual system for processing video data 
comprising: an object detection module capable of providing a pliirality of object features 
from the video data; an audi o processor module capable of providing a plurality of audio 
features from the video data; a processor coupled to the object detection and the audio 
segmentation modules, arranged to determine a maximum correlation value among a 
plurality of correlation values between the plurality of object features and the plurality of 
audio features, wherein said correlation values are determined as the sum of the elements 
of a subset between said audio features and selected object features, 

2. (Original)The system of claim 1, wherein the processor is further arranged to 
determine whether an animated object in the video data is associated with audio. 

3. (Currently amended)The system of claim [j;2]] 1, wherein the plurality of audio 
features comprisei 

two or more of the following: average energy, pitch, zero crossing, bandwidth, 
band central, roll off, low ratio, spectral flux and 12 MFCC components. 

4. (Original)The system of claim 2, wherein the animated object is a face and the 
processor is arranged to determine whether the face is speaking. 
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5. (Previously prc5ented)The system of claim 4, wherein the plurality of object features 
are eig^aces that represent global features of the face* 

6. (Currently amended)The system of claim I, further comprising: 

a latent semantic indexing module coupled to the processor and that preprocesses 
the plurality of object features and the plurality of audio features before the correlation is 
performed. 

7. (Original) The system of claim 6, wherein the latent semantic indexixig module 
includes a singular value decomposition module. 

8. (Previously presented)A method for identifying a speaking person within video data, 
the method comprising the steps of: 

receiving video data including image and audio information; 

determining a plurality of £ace image features from one or more faces in the video 

data; 

determining a plurality of audio features related to audio information; 

calculating correlation values between the plurality of face image features and the 
audio features, wherein said correlation values are determined a$ the sum of tfie elements 
of a subset between said audio features and selected object feature; and 

determining the speaking person based on a maximum of the correlation values. 
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9. (Original)The method according to claim 8, fijrther comprising the step of normalizing 
the face image features and the audio features. 

10. (Original)The method according to claim 9, further comprising the step of perforroing 
a singular value decomposition oa the uormalized face image features and the audio 
features. 

1 1 . (Original)The method accordiBg to claim 8, wherein the determining step includes 
determining the speaking person based upon the one or more faces that has the largest 
correlation. 

12. (Original)The method according to claim 10, wherein the calculating step includes 
forming a matrix of the face image features and the audio features. 

1 3. (Currently amended)The method according to claim 12, fiirther comprising the step of 

performing an optimal approximate fit using smaller matrices as compared to full 
rank matrices formed by the face image features and the axidio features. 

14. (Original) The method according to claim 13, wherein the rank of the smaller 
matrices is chosen to remove noise and unrelated information from, the full rank matrices. 
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15. (Previously presented)A memoty medium including code for processing a video 
including images and audio^ tlie code comprising: 

code to obtain a plurality of object features from the video; code to obtain a 
plurality of audio features from the video; 

code to detennine correlation values between the plurality of object features and 
the plurality of audio features, wherein said correlation values are determined as the sum 
of the elements of a subset between said audio features and selected object feature; and 

code to detemiine an association between one or more objects in the video and the 
audio based on a maximum of the correlation values. 

16. (Original)The memory mediura of claim 1 5, v*erein the one or more objects 
comprises one or more faces^ 

17. (OTiginal)The memory medium of claim 16, further comprising code to detemiine a 
speaking face. 

1 8. (Currently amended)The memory medium, of claim 1 5 , further comprising: 

code to create a matrix using the plurality of object features and the audio features 
and code to perform a singular value decomposition on the matrix. 

19. (Currently amended)The memory medium of claim 18, further cotnprising: 

code to perfomi an optimal approximate fit using smaller matrices as compared 
to full rank matrices foraied by the object features and the audio features, 
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20. (Original)The memory medium accoiding to claim 19, wherein the rank of the 
smaller matrices is chosen to remove noise and unrelated infonnation from the full rank 
matrices. 
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